Background:

Sleep is something that has always been a challenge for me. I struggle to fall asleep and am what some people would call a “night owl”. Because of this, I have always been fascinated in learning more about how to have quality sleep. Sleep is essential for all humans. According to the Sleep Foundation (2024), getting quality sleep can help you to concentrate, manage the effects of stress, make decisions, problem solve, heal your body, and fight infections and diseases. These are all important things that we need in our lives.

For this report, I was interested to see if bedtime has any bearing in the quality of sleep a person has and if not, what are some indicators for better sleep?

Cleaning the Dataset:

I found a dataset “Sleep_Efficiency.csv” on “kaggle.com”. I cleaned the column names using the janitor clean_names() function. The raw dataset contain a column “bedtime” which included dates and times. I only wanted to have the time, so I separated that out. I noticed that when I tried plotting the bedtime data that everything was on the right and left side of the graph but there wasn’t anything in the middle. This is because none of the subjects’ bedtime was during the day. What I wanted was for the times to wrap from the evening to the early morning to get a better idea. To do this, any time that was earlier than 6:00 p.m. I added 24 hours to it.

The dataset included different sleep types (Deep, Light, and REM) and the percentage of time the subjects were in each stage. They had these in different columns so I used pivot_longer() to make them into “sleep type” and “sleep type percentage” columns. Once I was done cleaning the dataset, I saved the clean version of the data into a new csv file which is what I used in this report. Loading this data again, I needed to do some cleaning to change to the data types that I wanted. I changed gender, smoking status, and sleep type into factors.

Lets take a quick look at the cleaned dataset

## Rows: 1,356
## Columns: 13
## $ id                    <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, …
## $ age                   <dbl> 65, 65, 65, 69, 69, 69, 40, 40, 40, 40, 40, 40, …
## $ gender                <fct> Female, Female, Female, Male, Male, Male, Female…
## $ sleep_duration        <dbl> 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 6.0…
## $ sleep_efficiency      <dbl> 0.88, 0.88, 0.88, 0.66, 0.66, 0.66, 0.89, 0.89, …
## $ awakenings            <dbl> 0, 0, 0, 3, 3, 3, 1, 1, 1, 3, 3, 3, 3, 3, 3, 0, …
## $ caffeine_consumption  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 50, 50, 50, 0, 0, 0, …
## $ alcohol_consumption   <dbl> 0, 0, 0, 3, 3, 3, 0, 0, 0, 5, 5, 5, 3, 3, 3, 0, …
## $ smoking_status        <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, Y…
## $ exercise_frequency    <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 3, 3, 1, …
## $ sleep_type            <fct> REM, Deep, Light, REM, Deep, Light, REM, Deep, L…
## $ sleep_type_percentage <dbl> 18, 70, 12, 19, 28, 53, 20, 70, 10, 23, 25, 52, …
## $ bedtime               <dbl> 2500, 2500, 2500, 2600, 2600, 2600, 2130, 2130, …

I was interested to see if bedtime would have any effect on sleep efficiency

# Plot of bedtime effect on sleep efficiency
bed_plot1 <- df %>% 
  ggplot(aes(x=factor(bedtime), y=sleep_efficiency, color=factor(bedtime))) +
  geom_point() +
  scale_x_discrete(labels=c("9:00 p.m.", "9:30 p.m.", "10:00 p.m.", "10:30 p.m.", "11:00 p.m.", "12:00 a.m.",
                            "12:30 a.m.", "1:00 a.m.", "1:30 a.m.", "2:00 a.m.", "2:30 a.m.")) +
  labs(x="Bedtime", y="Sleep Efficiency Proportion", title="Scatterplot of Bedtime's Effect on Sleep Efficiency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=25),
        legend.position = "none")

# Interactive plot
ggplotly(bed_plot1)

I was surprised that it doesn’t appear as though bedtime had any effect on sleep efficiency.

I did a glm model using all of the data points to see what did have an effect on sleep efficiency. I first separated out REM sleep from the sleep type percentage column.

rem_df <- df %>% 
  filter(sleep_type == "REM")

mod1REM <- glm(data=rem_df, formula = sleep_efficiency ~ age + gender + 
             sleep_duration + awakenings + caffeine_consumption + 
             alcohol_consumption + smoking_status + exercise_frequency + sleep_type_percentage + bedtime)
tidy(mod1REM) %>% 
  kableExtra::kable() %>% 
  kableExtra::kable_classic(lightable_options = 'hover')
term estimate std.error statistic p.value
(Intercept) 0.8653406 0.0911912 9.4892980 0.0000000
age 0.0013841 0.0003856 3.5899924 0.0003744
genderMale 0.0082960 0.0108749 0.7628562 0.4460262
sleep_duration -0.0023318 0.0055513 -0.4200353 0.6746989
awakenings -0.0475231 0.0038185 -12.4454136 0.0000000
caffeine_consumption 0.0001605 0.0001775 0.9044667 0.3663256
alcohol_consumption -0.0237291 0.0031003 -7.6537166 0.0000000
smoking_statusYes -0.0781570 0.0106176 -7.3610714 0.0000000
exercise_frequency 0.0117639 0.0037580 3.1303375 0.0018822
sleep_type_percentage 0.0016225 0.0014384 1.1279679 0.2600508
bedtime -0.0000210 0.0000322 -0.6519992 0.5147990

REM was not significant on sleep efficiency. Instead I will see if Deep sleep is.

deep_df <- df %>% 
  filter(sleep_type == "Deep") # Time in deep sleep significant. REM is not

mod1 <- lm(data=deep_df, formula = sleep_efficiency ~ age + gender + 
                sleep_duration + awakenings + caffeine_consumption + 
                alcohol_consumption + smoking_status + exercise_frequency + sleep_type_percentage + bedtime)
tidy(mod1) %>% 
  kableExtra::kable() %>% 
  kableExtra::kable_classic(lightable_options = 'hover')
term estimate std.error statistic p.value
(Intercept) 0.5812434 0.0582704 9.9749417 0.0000000
age 0.0011359 0.0002612 4.3491404 0.0000176
genderMale -0.0033367 0.0073479 -0.4540953 0.6500216
sleep_duration 0.0017794 0.0037713 0.4718332 0.6373188
awakenings -0.0323604 0.0026860 -12.0479598 0.0000000
caffeine_consumption 0.0003140 0.0001200 2.6161485 0.0092502
alcohol_consumption -0.0078486 0.0022348 -3.5120300 0.0004986
smoking_statusYes -0.0436934 0.0073560 -5.9398190 0.0000000
exercise_frequency 0.0069996 0.0025578 2.7365283 0.0065029
sleep_type_percentage 0.0051885 0.0002460 21.0909567 0.0000000
bedtime -0.0000284 0.0000218 -1.3066559 0.1921261

Deep sleep was significant as well as many others. The variables which were significant are:

I was surprised by a few. Bedtime of course, but also about gender and sleep duration not being significant. I would have thought that the longer you slept the more quality of sleep you would have.

Visualization:

We will look at some of these relationships.

# Age
p1 <- deep_df %>% 
  ggplot(aes(x=age, y = sleep_efficiency, color=factor(age))) + 
  geom_jitter() +
  geom_smooth(color="steelblue", se=FALSE, method="lm") +
  theme_minimal() +
  labs(title="Age",
       x="Age",
       y="Sleep Efficiency") +
  theme(legend.position = "none")

# Awakenings
p2 <- deep_df %>% 
  filter(!is.na(awakenings)) %>% 
  ggplot(aes(x=awakenings, y = sleep_efficiency, color=factor(awakenings))) + 
  geom_jitter() +
  geom_smooth(color="steelblue", method="lm", se=FALSE) +
  theme_minimal() +
  labs(title="Awakenings",
       x="Awakenings",
       y="Sleep Efficiency") +
  theme(legend.position = "none") 
p1 + p2

# Smoking
p5 <- deep_df %>% 
  ggplot(aes(x=factor(smoking_status), y = sleep_efficiency, fill=factor(smoking_status))) +
  labs(title="Smoking Status",
       x="Smoking Status",
       y="Sleep Efficiency") +
  geom_violin() +
  theme_minimal() +
  scale_fill_brewer(palette = "Paired") +
  theme(legend.position = "none") 

# Exercise
p6 <- deep_df %>% 
  filter(!is.na(exercise_frequency)) %>% 
  ggplot(aes(x=exercise_frequency, y = sleep_efficiency, color=factor(exercise_frequency))) + 
  geom_jitter() +
  geom_smooth(color="steelblue", method="lm", se=FALSE) +
  theme_minimal() +
  labs(title="Exercise Frequency",
       x="Exercise Frequency (weekly)",
       y="Sleep Efficiency") +
  theme(legend.position = "none") 
p5 + p6

# Deep Sleep
deep_df %>% 
  filter(!is.na(awakenings)) %>% 
  ggplot(aes(x=sleep_type_percentage, y = sleep_efficiency, color=factor(awakenings))) +
  geom_point() +
  geom_smooth(se=FALSE, method="lm") +
  theme_minimal() +
  facet_wrap(~factor(awakenings))

Modeling:

Let’s make some models to see if we can predict sleep efficiency

mod1 <- lm(data=deep_df, formula = sleep_efficiency ~ bedtime)
mod2 <- lm(data=deep_df, formula = sleep_efficiency ~ 
             age + awakenings + sleep_type_percentage)
mod3 <- lm(data=deep_df, formula = sleep_efficiency ~ 
             age * awakenings * sleep_type_percentage)
mod4 <- lm(data=deep_df, formula = sleep_efficiency ~
             age + awakenings + alcohol_consumption + smoking_status +
             sleep_type_percentage)

We will look at the mean-squared-error’s for each of these models

Mod1

mean(mod1$residuals^2)
## [1] 0.01786273

Mod2

mean(mod2$residuals^2)
## [1] 0.004803478

Mod3

mean(mod3$residuals^2)
## [1] 0.004494944

Mod4

mean(mod4$residuals^2)
## [1] 0.004279514

We will compare the models to see which is the best

compare_performance(mod1, mod2, mod3, mod4)
## When comparing models, please note that probably not all models were fit
##   from same data.
## # Comparison of Model Performance Indices
## 
## Name | Model |   AIC (weights) |  AICc (weights) |   BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
## ------------------------------------------------------------------------------------------------------
## mod1 |    lm |  -530.6 (<.001) |  -530.5 (<.001) |  -518.3 (<.001) | 0.021 |     0.019 | 0.134 | 0.134
## mod2 |    lm | -1070.2 (<.001) | -1070.1 (<.001) | -1049.9 (0.083) | 0.739 |     0.737 | 0.069 | 0.070
## mod3 |    lm | -1090.9 (0.997) | -1090.5 (0.996) | -1054.3 (0.752) | 0.756 |     0.752 | 0.067 | 0.068
## mod4 |    lm | -1079.5 (0.003) | -1079.2 (0.004) | -1051.3 (0.165) | 0.768 |     0.765 | 0.065 | 0.066
compare_performance(mod1, mod2, mod3, mod4) %>% 
  plot()
## When comparing models, please note that probably not all models were fit
##   from same data.

From these points, it appears that Mod3 is the best

Predictions:

I made predictions with hypothetical data

# Add predictions
df2 <- add_predictions(deep_df, mod3)

# Make hypothetical values form the independent variables
newdf <- data.frame(age = c(59, 32, 13, 43),
                    awakenings = c(3, 2, 1, 0),
                    sleep_type_percentage = c(49, 18, 45, 77))

# Make predictions
pred <- predict(mod3, newdata=newdf)

# New data frame
hyp_preds <- data.frame(age = newdf$age,
                        awakenings = newdf$awakenings,
                        sleep_type_percentage = newdf$sleep_type_percentage,
                        pred=pred)

# Add a new column showing whether a data point is real or hypothetical
df2$prediction_type <- "Real"
hyp_preds$prediction_type <- "Hypothetical"

# Join real data and hypothetical data (with model predictions)
fullpreds <- full_join(df2, hyp_preds)

Predictions plotted alongside real data

ggplot(fullpreds, aes(x = sleep_type_percentage, y = pred, color = prediction_type)) +
  geom_point(aes(y = sleep_efficiency), color = "Black") +
  geom_point() +
  theme_minimal() 

Conclusions:

The model I made seems to be fairly accurate. From this report, I can see that there are many factors that can attribute to a efficient night of sleep.

References

National Sleep Foundation. (2024). Why do we need sleep? Sleep Foundation. Retrieved from https://www.sleepfoundation.org/how-sleep-works/why-do-we-need-sleep